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ABSTRACT 

We  develop  from  first  principles  an  exact  model  of  the  behavior 
of  loop  nests  executing  in  a  memory  hierarchy,  by  using  a  nontra- 
ditional  classification  of  misses  that  has  the  key  property  of  com- 
posability.  We  use  Presburger  formulas  to  express  various  kinds  of 
misses  as  well  as  the  state  of  the  cache  at  the  end  of  the  loop  nest. 
We  use  existing  tools  to  simplify  these  formulas  and  to  count  cache 
misses.  The  model  is  powerful  enough  to  handle  imperfect  loop 
nests  and  various  flavors  of  non-linear  array  layouts  based  on  bit  in¬ 
terleaving  of  array  indices.  We  also  indicate  how  to  handle  modest 
levels  of  associativity,  and  how  to  perform  limited  symbolic  analy¬ 
sis  of  cache  behavior.  The  complexity  of  the  formulas  relates  to  the 
static  structure  of  the  loop  nest  rather  than  to  its  dynamic  trip  count, 
allowing  our  model  to  gain  efficiency  in  counting  cache  misses  by 
exploiting  repetitive  patterns  of  cache  behavior.  Validation  against 
cache  simulation  confirms  the  exactness  of  our  formulation.  Our 
method  can  serve  as  the  basis  for  a  static  performance  predictor  to 
guide  program  and  data  transformations  to  improve  performance. 

1.  INTRODUCTION 

The  growing  gap  between  processor  cycle  time  and  main  mem¬ 
ory  access  time  makes  efficient  use  of  the  memory  hierarchy  ever 
more  important  for  performance-oriented  programs.  Many  compu¬ 
tations  running  on  modern  machines  are  often  limited  by  the  re¬ 
sponse  of  the  memory  system  rather  than  by  the  speed  of  the  pro¬ 
cessor.  Caches  are  an  architectural  mechanism  designed  to  bridge 
this  speed  gap.  by  satisfying  the  majority  of  memory  accesses  with 
low  latency  and  at  close  to  processor  speed.  However,  programs 
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must  exhibit  good  locality  of  reference  in  their  memory  access  se¬ 
quences  in  order  to  realize  the  performance  benefit  of  caches. 

Optimizing  compilers  attempt  to  speed  up  programs  by  perform¬ 
ing  semantics-preserving  code  transformations.  Loop  transforma¬ 
tions  such  as  iteration  space  tiling  [62]  are  a  major  source  of  per¬ 
formance  benefits.  They  restructure  loop  iterations  in  ways  that 
make  the  memory  reference  sequence  more  cache-friendly.  The 
theory  of  loop  transformations  is  well-developed  in  terms  of  decid¬ 
ing  the  legality  of  a  proposed  transformation  and  generating  code 
for  the  transformed  loop.  However,  models  of  the  expected  per¬ 
formance  gains  of  performing  a  given  loop  transformation  are  less 
well-developed  [19,  38,  45,  48,  50,  51,  61].  Where  such  models 
exist,  they  are  often  heuristic  or  approximate.  For  example,  tiling 
requires  the  choice  of  tile  sizes,  and  the  performance  of  a  loop  nest 
is  typically  a  non-smooth  function  of  the  extents  of  the  loop  bounds, 
the  tile  sizes,  and  the  cache  parameters  [13,  19,  38],  The  model  we 
develop  in  this  paper  can  be  used  to  quantitatively  determine  the 
number  of  cache  misses  of  a  proposed  transformation  without  ex¬ 
plicit  simulation.  Ultimately,  such  a  model  could  be  used  to  guide 
the  choice  of  parameters  in  such  program  transformations. 

A  complementary  method  for  improving  sequential  program  per¬ 
formance  that  has  been  investigated  in  recent  years  is  that  of  trans¬ 
forming  the  memory  layout  of  its  data  structures.  Such  data  layout 
transformations  can  vary  in  complexity;  examples  include  trans¬ 
position  and  stride  reordering  [32],  array  merging  [39],  intra-  and 
inter-array  padding  [50,  51],  data  copying  [38],  and  non-linear  ar¬ 
ray  layouts  [14],  Once  again,  proper  choice  of  parameter  values  is 
of  paramount  importance  in  getting  good  performance  out  of  such 
transformations,  but  the  models  guiding  this  optimization  are  of¬ 
ten  inexact.  For  instance,  Rivera  and  Tseng  [50,  51]  use  heuristics 
to  determine  inter-array  pad.  However,  there  is  empirical  evidence 
that  almost  every  choice  of  pad  can  be  catastrophically  bad  for  a 
program  as  simple  as  matrix  transposition  [16].  Better  models  are 
clearly  needed  to  guide  such  optimizations.  Our  work  in  this  paper 
is  a  step  in  this  direction. 

An  aggressive  form  of  data  optimization  is  the  use  of  certain 
families  of  non-linear  array  layouts  that  are  based  on  interleaving 
the  bits  in  the  binary  expansion  of  the  row  and  column  indices  of 
arrays.  Previous  studies  have  demonstrated  performance  gains  as 
well  as  robustness  of  performance  resulting  from  the  use  of  such 
layouts  [14,  15].  Yet  it  is  difficult  to  ascertain,  short  of  simulation, 
the  memory  behavior  of  a  program  given  a  particular  data  layout. 
This  paper  works  towards  building  an  analytical  model  of  cache  be¬ 
havior  for  such  layouts  that  can  provide  insight  into  the  relationship 
between  such  data  layouts  and  memory  behavior. 

Our  model  is  an  alternative  to  the  well-known  Cache  Miss  Equa¬ 
tions  (CME)  model  of  Ghosh  et  al.  [26].  Compared  to  CME,  our 
model  has  the  following  strengths  and  weaknesses. 
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•  Our  model  is  exact  as  a  consequence  of  our  use  of  Presburger 
arithmetic  as  the  underlying  formalism.  Ghosh  et  al.  [26] 
use  the  abstraction  of  reuse  vectors  to  simplify  the  analysis. 
Reuse  vectors  do  not  exist  for  all  loop  nests,  and  certainly  do 
not  exist  in  the  presence  of  non-linear  array  layouts. 

•  Our  model  accurately  determines  the  state  of  the  cache  at  the 
end  of  executing  a  loop  nest.  This  functionality  is  important 
for  accurately  counting  compulsory  misses  [30],  for  rapidly 
leap-frogging  up  to  a  certain  point  in  the  computation,  and 
for  handling  multiple  loop  nests. 

•  Our  model  handles  imperfect  loop  nests  in  addition  to  perfect 
loop  nests.  We  apply  a  transformation  of  Ahmed  et  al.  [2,  3] 
to  an  imperfect  loop  nest,  thereby  converting  it  to  a  perfect 
loop  nest  with  guards  on  statements.  Ghosh  et  al.  [26]  con¬ 
sider  only  a  single  perfect  loop  nest. 

•  Our  model  handles  a  variety  of  array  layout  functions,  from 
row-  and  column-major  to  non-linear.  We  will  subsequently 
refer  to  row-  and  column-major  layouts  as  canonical  lay¬ 
outs  [17],  The  formulation  for  non-linear  layouts  is  new.  to 
the  best  of  our  knowledge. 

•  Our  model  handles  caches  with  modest  levels  of  associativ¬ 
ity  in  a  natural  way.  While  Ghosh  et  al.  [25]  can  handle 
set-associative  caches,  their  solution  method  is  equivalent  to 
simulation  in  the  worst  case. 

•  Our  model  is  capable  of  symbolic  analysis.  This  is  a  direct 
consequence  of  our  use  of  the  Presburger  formalism.  For  ex¬ 
ample,  we  can  simplify  a  formula  for  the  cross-interference 
between  two  arrays  while  keeping  the  difference  of  their  start¬ 
ing  addresses  symbolic.  The  simplified  formula  can  be  rapidly 
evaluated  for  specific  values  of  this  variable. 

•  The  enhanced  capabilities  of  our  model  come  at  the  cost  of 
computational  complexity,  in  the  form  of  super-exponential 
worst  case  behavior  of  algorithms  for  satisfiability  check¬ 
ing  and  quantifier  elimination  of  Presburger  formulas  [60], 
While  we  have  a  prototype  implementation  of  our  model  as 
a  SUIF  [55]  pass,  and  the  analysis  and  formula  generation 
portions  of  the  implementation  are  acceptably  efficient,  sig¬ 
nificant  improvements  are  necessary  to  the  robustness  and 
efficiency  of  the  simplification  and  counting  parts. 

Compared  to  explicit  simulation,  our  formulas  capture  temporal 
patterns  of  cache  behavior  that  may  not  be  apparent  in  simulation. 
Moreover,  an  analytical  cache  model  provides  deeper  insight  into 
the  behavior  than  what  may  be  learned  from  simulation.  We  antici¬ 
pate  that  such  information  will  make  it  possible  to  guide  the  choice 
of  data  layouts  that  optimize  cache  behavior.  We  validate  the  re¬ 
sults  of  all  our  formulas  against  simulation  in  Section  4,  thereby 
confirming  their  exactness. 

The  remainder  of  this  paper  is  structured  as  follows.  Section  2  re¬ 
views  background  material  for  discussing  our  approach  to  the  cache 
analysis  problem:  basics  of  cache  memory  (Section  2.1),  a  new 
classification  of  cache  misses  (Section  2.2),  the  polyhedral  model 
(Section  2.3),  and  Presburger  arithmetic  (Section  2.4).  Section  3 
constructs  our  model.  Section  4  provides  some  preliminary  results 
obtained  using  our  cache  analysis  model.  Section  5  discusses  re¬ 
lated  work.  Section  6  presents  conclusions  and  future  work. 

2.  BACKGROUND 

This  section  provides  background  material  and  defines  notation 
for  the  remainder  of  the  paper. 


2.1  Basics  of  memory  hierarchies 

We  assume  a  simplified  memory  hierarchy  that  processes  one 
memory  access  at  a  time,  with  no  distinction  between  memory 
reads  and  writes. 

2.1.1  Cache  structure 

The  structure  of  a  single  level  of  a  memory  hierarchy — a  cache — 
is  generally  characterized  by  three  parameters  [30]:  Associativity, 
Block  size,  and  Capacity.  Capacity  and  block  size  are  in  units  of  the 
minimum  memory  access  size  (usually  one  byte).  A  cache  can  hold 
a  maximum  of  C  bytes.  Flowever,  due  to  physical  constraints,  the 
cache  is  divided  into  cache  frames  of  size  B  that  contain  B  contigu¬ 
ous  bytes  of  memory — called  a  memory  block.  The  associativity  A 
specifies  the  number  of  different  frames  in  which  a  memory  block 
can  reside.  If  a  block  can  reside  in  any  frame  (i.e.,  A  =  ^),  the 
cache  is  said  to  be  fully  associative',  if  A  =  1,  the  cache  is  direct- 
mapped;  otherwise,  the  cache  is  A-way  set  associative.  A  cache 
set  is  the  group  of  frames  in  which  a  memory  block  can  reside,  and 
the  number  of  cache  sets,  S,  is  given  by  S  =  -£g  . 

We  assume  a  two-level  memory  hierarchy,  consisting  of  an  A- 
way  set  associative  cache  with  block  size  of  B  bytes  and  total  ca¬ 
pacity  of  C  bytes  followed  by  main  memory.  We  also  assume  that 
main  memory  is  large  enough  to  hold  all  the  data  referenced  by 
the  program.  The  function  B  converts  a  memory  byte  address  into 
a  memory  block  address  (with  B(a)  =  [ a/B\ ).  The  function  S 
converts  a  memory  block  address  to  the  cache  set  to  which  it  maps 
(thus,  S(b)  =  b  mod  S). 

2.1.2  Cache  dynamics 

For  an  access  to  memory  address  to,  the  cache  controller  de¬ 
termines  whether  memory  block  B(ni )  is  resident  in  any  of  the  A 
cache  frames  in  cache  set  S{B(m)).  If  the  memory  block  is  resi¬ 
dent,  a  cache  hit  is  said  to  occur,  and  the  cache  satisfies  the  access 
after  its  access  latency.  If  the  memory  block  is  not  resident,  a  cache 
miss  is  said  to  occur. 

The  state  of  the  cache  represents  the  memory  block(s)  contained 
in  each  set  of  the  cache  at  any  point  during  a  program’s  execution. 
Thus,  in  a  direct-mapped  cache  where  each  set  holds  one  frame, 
the  cache  state  C  maps  set  s  to  the  address  of  the  memory  block 
contained  there.  In  general,  C  is  a  map  from  cache  sets  to  the  sets 
of  memory  blocks  that  they  contain.  C(s)  is  empty  for  a  cache  set 
s  to  which  no  block  has  been  mapped. 

2.2  Classification  of  cache  misses 

From  an  architectural  standpoint,  cache  misses  fall  into  one  of 
three  classes:  compulsory,  capacity,  and  conflict  [30],  Capacity  and 
conflict  misses  are  often  combined  and  called  replacement  misses. 
This  classification  is  extremely  useful  for  understanding  the  role  of 
capacity  and  associativity  in  the  performance  of  a  cache;  however, 
it  does  not  have  the  property  of  composability. 

Consider  two  program  fragments  P\  and  P2,  where,  for  i  € 
{1,2},  fragment  Pj  incurs  C,  cold  misses  and  Rj  replacement 

dof 

misses.  Now  consider  the  program  fragment  P12  =  Pi;  P2  formed 
by  sequential  composition  of  Pi  and  P2,  and  suppose  that  it  incurs 
C12  cold  misses  and  R12  replacement  misses.  There  is  no  simple 
relation  connecting  the  misses  of  the  whole  to  the  misses  of  the 
parts.  In  particular,  C12  4-  R12  Ci  +  Ri  +  C2  +  R2-  Composi¬ 
tion  is  a  fundamental  operation  in  the  construction  of  programs  and 
in  the  definition  of  programming  language  semantics.  As  we  wish 
to  count  cache  misses  for  individual  program  fragments  and  their 
compositions,  we  propose  a  different  classification  that  is  compos- 
able. 

We  classify  misses  from  a  program  fragment  into  the  following 


two  classes. 

•  Interior  misses  are  those  data  references  that  are  guaranteed 
to  miss,  independent  of  the  initial  cache  state  when  the  frag¬ 
ment  begins  execution.  In  other  words,  given  the  code,  the  ar¬ 
ray  layouts,  and  the  structural  parameters  of  the  cache,  such 
misses  can  be  identified/enumerated/counted  by  analyzing 
the  fragment  in  isolation. 

•  Potential  boundary  misses  are  those  data  references  that  may 
either  hit  or  miss,  depending  on  the  initial  cache  state  when 
the  fragment  begins  execution.  The  potential  occurrence  of 
such  misses  can  be  identified  by  analyzing  the  fragment  in 
isolation,  but  the  actual  occurrence  of  the  miss  can  be  deter¬ 
mined  only  after  considering  the  initial  cache  state. 

Another  equivalent  view  of  this  classification  is  that  we  can  stati¬ 
cally  examine  a  program  fragment  in  isolation  and  place  each  data 
memory  access  that  it  makes  into  one  of  three  categories:  those  that 
are  guaranteed  to  hit,  those  that  are  guaranteed  to  miss  (interior 
misses),  and  those  that  could  hit  or  miss  depending  on  the  initial 
cache  state  (potential  boundary  misses).  In  a  second  step,  we  fur¬ 
ther  partition  the  potential  boundary  misses  into  hits  and  misses  by 
resolving  them  against  the  cache  state  when  the  program  fragment 
starts  executing.  We  call  these  misses  boundary  misses.  It  follows 
that,  in  order  to  compose  program  fragments,  we  also  need  to  de¬ 
termine  the  state  of  the  cache  after  executing  a  program  fragment. 
For  a  given  program  fragment  P  and  an  initial  cache  state  S ,  we 
will  let  ^(P,  5”)  denote  the  final  cache  state  after  fragment  P  has 
completed  execution. 

Theorem  2.1  Let  program  fragment  Pi  executing  from  initial  cache 
state  Co  incur  I\  interior  misses  and  B i  (Co )  boundary  misses  and 
produce  final  cache  state  Ci  =  IP  (Pi,  Co).  Let  program  frag¬ 
ment  If  executing  from  initial  cache  state  Ci  incur  I2  interior 
misses  and  )  boundary  misses  and  produce  final  cache  state 

C2  =  '5>{p2, Ci).  Let  program  fragment  P12  =f  Pi;  P2  executing 
from  initial  cache  state  Co  incur  1 12  interior  misses  and  B 12  (Co  ) 
boundary  misses  and  produce  final  cache  state  C12  =  \P(Pi2,  Co ). 
Then  the  following  relations  hold. 

/l2+Pl2(Co)  =  Tl  +Pl(Co)+/2  +  S2(Cl) 

C12  =  c2 


PROOF.  The  proof  follows  immediately  from  the  semantics  of 
program  composition  and  from  the  deterministic  nature  of  the  pro¬ 
gram  fragments  and  of  the  cache.  □ 

Theorem  2.1  has  several  important  consequences. 

•  The  theorem  enables  the  analysis  of  cache  misses  of  a  com¬ 
posite  program  fragment  in  terms  of  the  cache  miss  behavior 
of  its  parts.  Each  part  can  be  analyzed  in  isolation,  and  the 
results  of  these  analyses  can  be  combined  using  cache  states. 
We  will  show  later  how  to  efficiently  propagate  cache  state 
across  a  program  fragment. 

•  Stronger  assertions,  like  I12  =  Ii  +  I2,  do  not  hold  in  gen¬ 
eral. 

•  The  theorem  is  silent  about  the  nature  of  program  fragments 
Pi  and  P2  or  about  how  to  calculate  boundary  and  interior 
misses  for  them.  In  the  remainder  of  the  paper,  we  will 
choose  loop  nests  as  our  atomic  program  fragments  and  use 
Presburger  formulas  to  codify  the  various  kinds  of  misses. 


•  The  theorem  provides  additional  leverage  if  symbolic  anal¬ 
ysis  of  the  atomic  program  fragments  is  possible.  For  ex¬ 
ample,  block-recursive  codes  [4]  employ  multiple  dynamic 
instances  of  the  same  loop  nest  differing  only  in  the  starting 
addresses  of  the  data  arrays  on  which  they  operate.  Symbolic 
analysis  of  such  fragments  would  allow  the  cost  of  analysis 
to  be  amortized  over  multiple  uses  of  the  resulting  formulas. 

•  Note  that  boundary  misses  for  a  fragment  are  bounded  from 
above  by  the  cache  footprint  of  the  data  structures  it  accesses, 
which  is  in  turn  bounded  from  above  by  the  number  of  cache 
frames.  This  number  is  typically  much  smaller  than  the  num¬ 
ber  of  interior  misses.  We  could  therefore  avoid  the  calcu¬ 
lation  of  cache  state  and  approximate  the  number  of  cache 
misses  of  the  composite  program  by  T\  +  I2,  with  an  accom¬ 
panying  error  bound. 

2.3  The  polyhedral  model 

Our  model  for  analyzing  cache  behavior  of  loop  nests  is  based 
on  the  well-known  polyhedral  model  [20].  The  program  fragment 
whose  cache  behavior  we  are  trying  to  analyze  is  a  nested  normal¬ 
ized  loop  with  d  levels  of  nesting,  numbered  0  through  d  —  1  from 
outermost  to  innermost.  We  first  consider  perfect  loop  nests;  we 
will  extend  the  model  to  imperfect  loop  nests  in  Section  3.4.  The 
upper  bound  Uj  of  ij ,  the  loop  control  variable  (LCV)  for  loop  j,  is 
an  affine  function  of  the  LCVs  to  through  tj-i .  The  iteration  space 
I.  is  the  set  of  all  valid  combinations  of  LCV  values  that  are  within 
the  bounds  of  the  loop  nest.  The  notation  £  =  [£o, . .  . ,  £d-i]T  de¬ 
notes  a  generic  point  in  the  iteration  space  I.  The  iteration  space 
possesses  a  total  order  which  in  the  polyhedral  model  is  the 
lexicographic  ordering.  The  order  specifies  the  temporal  order  in 
which  the  iteration  points  in  the  iteration  space  are  executed. 

The  loop  accesses  elements  of  arrays  V,  0  i  through  V'm_  l  1 .  Ar¬ 
ray  variable  T’^  has  d,  dimensions,  with  nj  being  the  extent  of  the 
array  in  the  (j  +  1)^  dimension.  The  data  index  space  T>i  corre¬ 
sponding  to  array  Y1'1'  is  the  Cartesian  product  [0,  no  —  1]  x  ■  •  •  x 
[0,  ndi- 1  -  1]. 

The  statements  in  the  loop  body  make  k  references  to  array  vari¬ 
ables.  The  «*■  reference  Rt  has  three  components:  N, ,  the  name  of 
the  array  referenced  (so  that  N{  =  Y(j  1  for  some  j  6  [0,  m  —  1]); 
Fi,  the  index  expression  of  the  reference,  which  identifies  the  co¬ 
ordinates  of  the  array  element  accessed  by  this  reference  at  itera¬ 
tion  point  l;  and  Sh,  the  statement  that  contains  reference  R, .  To 
include  statement  Sh  in  the  definition  of  reference  Rj  may  seem 
excessive  at  this  point,  but  it  will  be  useful  in  Section  3.4  when 
we  consider  imperfect  loop  nests.  The  index  expression  Fi  is  con¬ 
strained  to  be  an  affine  function  of  £  in  each  of  its  components. 
Thus,  Fj  is  a  function  from  the  iteration  space  T  to  the  data  index 
space  Vni  ■ 

Borrowing  terminology  from  Ghosh  et  al.  [26],  we  call  a  static 
instance  of  a  memory  read  or  write  a  reference,  and  a  dynamic  in¬ 
stance  of  that  read  or  write  an  access.  A  reference  and  an  iteration 
point  uniquely  define  an  access.  The  total  order  -<  on  iterations 
almost  induces  a  similar  total  order  on  accesses;  however,  two  ac¬ 
cesses  in  the  same  iteration  need  to  be  ordered  as  well.  We  compose 
the  total  order  -<  on  the  iteration  space  and  the  order  among  refer¬ 
ences  of  an  iteration  to  define  a  total  order  “precedes”  (written  <]) 
among  accesses.  Thus,  access  (Rt.  u)  precedes  access  ( Rj,  v )  iff 
(u  -<  v)  V  (u  =  v  A  i  <  j). 

Several  quantities  are  associated  with  array  :  a  layout  func¬ 
tion  Ci,  which  is  a  1-1  map  from  X>,  into  the  memory  address  space 
Z^;  pi,  the  starting  byte  address  of  the  array;  and  the  number 
of  bytes  per  array  element.  Applying  Ci  to  an  element  of  the  array 


Object 

Mathematical 

Representation 

An  iteration  point 

l 

ith  array  reference 

Ri  =  (T’(J),  Fi,  Sh) 

Access  made  by  R,  at  t 

(Rid) 

Array  element  accessed  by  Rj,  at  l 

ei  =  YUJ  [#(*)] 

Byte  address  of  e,: 

to,  =  pi  +  Cj(Fi(t))  •  fj 

Block  address  of  to, 

hi  =  B(rm ) 

Cache  set  to  which  6,  maps 

Si  =  S(bi) 

Table  1:  Table  of  notation. 


3.  THE  CACHE  ANALYSIS  MODEL 

The  problem  of  central  interest  to  us  is  the  following. 

Given  a  cache  configuration  as  in  Section  2.1,  a  loop 
nest  L  meeting  the  conditions  of  Section  2.3,  the  layout 
functions  of  the  arrays  accessed  in  L,  and  an  initial 
cache  state  C,  „  : 

•  count  the  interior  misses  incurred  by  L; 

•  count  the  boundary  misses  incurred  by  L; 

•  find  the  cache  state  Cout  after  execution  o/L. 


produces  an  offset,  and  multiplying  the  offset  by  /%  gives  the  byte 
offset  from  the  starting  address  of  the  array  in  memory.  Adding  this 
offset  to  pi  then  gives  the  byte  address  of  the  element. 

Putting  all  of  this  notation  together,  we  have  the  objects  of  inter¬ 
est  and  their  mathematical  representations  shown  in  Table  1. 


Example  1  Consider  the  following  loop  nest  for  matrix  multipli¬ 
cation,  which  we  present  in  a  stylized  pseudo-code  in  an  attempt  to 
remain  language-neutral. 

do  i  =  0,  n-1 
do  j  =  0,  n-1 
do  k  =  0,  n-1 

SO:  C  [  i ,  j  ]  =  A[i,k]  *B[k,  j]+C[i,  j] 

end 
end 
end 


This  loop  nest  has  depth  d  =  3.  The  LCVs  are  to  =  i,  t  \  =  j, 
and  i2  =  k.  The  loop  nest  accesses  three  arrays:  y,0)  =  ,4, 
Y(1)  =  B.  and  l'(2)  =  C.  Each  array  is  two-dimensional,  so 
that  T>o  =  1>i  =  T>2  =  [0,  n  —  1]  x  [0,  n  —  1],  There  are  four 
array  references:  R0  =  A[i,k\,  Ri  =  B[k,j],  R2  =  C[i,j ] 
(the  read  access),  and  R3  =  C[i,j]  (the  write  access).  The  in¬ 
dex  expressions  of  the  four  references  are  F0  =  [  j*.  1 , 


Fi  =  J  J  J  ,  and  F2  =  F3 


ences  are  contained  in  statement  So. 


1  0  0 
0  1  0 


.  All  refer- 


2.4  Presburger  arithmetic 

Presburger  arithmetic  [31]  is  the  subset  of  first  order  logic  cor¬ 
responding  to  the  theory  of  integers  with  addition.  Presburger  for¬ 
mulas  consist  of  affine  constraints  on  integer  variables,  which  can 
be  either  constraints  of  equality  or  inequality.  The  constraints  are 
linked  by  the  logical  operators  -1,  A  and  V,  and  the  quantifiers  V 
and  3.  It  has  been  used  to  model  various  aspects  of  programming 
languages,  as  well  as  in  other  areas  such  as  timing  verification  [6, 
7],  We  use  Presburger  formulas  to  define  polytopes  whose  contents 
describe  interesting  events  like  cache  misses. 

Presburger  arithmetic  is  decidable;  however,  a  quantifier  elimi¬ 
nation  decision  procedure  has  a  superexponential  upper  bound  on 

performance.  More  precisely,  the  truth  of  a  sentence  of  length  n 

22f"» 

can  be  determined  within  2  time,  for  some  constant  p  >  1 
[46].  The  bound  is  tight  [60].  Bounded  quantifier  elimination  has 
worst-case  upper  and  lower  bounds  of  0(22  )  [60].  The  complex¬ 
ity  is  related  to  the  number  of  alternating  blocks  of  V  and  3  quanti¬ 
fiers  [52]  as  well  as  to  the  numerical  values  of  the  integer  constants 
and  their  co-primality  relationships. 

We  use  the  Omega  library  [34]  to  manipulate  and  simplify  our 
Presburger  formulas,  and  have  found  its  methods  reasonably  effi¬ 
cient  for  our  applications. 


A  simple  strategy  to  accomplish  all  of  these  goals  is  through  sim¬ 
ulation  of  the  code.  This  is  precisely  what  cache  simulators  [29, 
40,  54,  56]  do.  The  main  drawback  of  simulation  is  its  slowness: 
it  takes  time  proportional  to  the  running  time  of  the  code,  usually 
with  a  significant  multiplicative  factor  (10  —  100  is  typical).  In  the 
matrix  multiplication  kernel  of  Example  1,  this  time  is  0(n3).  Our 
goal  is  to  develop  much  faster  algorithms,  whose  existence  is  sug¬ 
gested  by  the  regularity  of  the  array  access  patterns  and  the  limited 
number  of  cache  sets  to  which  they  map. 

Section  3.1  provides  the  basic  Presburger  formulas  necessary  to 
describe  the  cache  events  in  Section  3.2.  Section  3.3  discusses  how 
we  count  cache  misses,  given  such  Presburger  formulas.  Section 

3.4  extends  our  model  to  analyze  imperfect  loop  nests.  Section 

3.5  shows  how  to  extend  our  formula  for  interior  misses  to  handle 
modest  levels  of  associativity.  Section  3.6  reviews  array  layouts 
based  on  bit  interleaving,  and  provides  the  Presburger  formulas  to 
describe  them.  Section  3.7  discusses  issues  related  to  physically 
indexed  caches. 

3.1  Describing  cache  structure  using  Presbur¬ 
ger  formulas 

We  now  present  the  basic  formulas  that  will  be  combined  in 
Section  3.2  to  describe  cache  events.  The  translations  are  mostly 
straightforward  or  well-known  [18.  49]. 

3.1.1  Valid  iteration  point 

The  predicate  l  €  T  describes  the  fact  that  iteration  point  t  = 
[to, ... ,  id- i]T  belongs  to  the  iteration  space. 

d—  1 

l  €  I  d=  /\  0  ^  it  <  U,  (1) 

i=0 

3.1.2  Lexicographical  ordering  of  accesses 

When  considering  all  accesses  that  occur  before  access  ( Rv,m ), 
we  include  any  access  occurring  at  an  iteration  l,  such  that  l  -< 
m.  To  be  complete,  we  must  also  include  any  access  made  at  it¬ 
eration  to  by  a  reference  that  occurs  before  R,,.  The  predicate 
( Ru,i )  <  ( Rv,m )  describes  the  fact  that  the  memory  access  made 
by  reference  Ru  at  iteration  l  precedes  the  memory  access  made  by 
Rv  at  m. 

(Ru.l)  <  (Rv,  to)  =f(eIAm£lA 

d—  1  i—1 

(  V  (ii  <  m,  A  £j  =  nij )  V 

i=0  j  =0 

d—  1 

(  lj  =  TOj  A  u  <  v ))  (2) 

J=0 


3.1.3  Mapping  memory  locations  to  cache  sets 
Let  A  =  associativity,  B  =  block  size,  C  =  capacity,  and  S  = 
■jig  =  number  of  cache  sets.  Then  memory  location  m  maps  to 
cache  set  s  =  [_■§■]  mod  S.  This  can  be  translated  to  the  following 
Presburger  formula,  where  the  auxiliary  variable  w  represents  the 
“cache  wraparound”.  Suppose  that  Y(:r  1  is  the  array  referencing 
memory  location  m,  and  let  ax  be  the  number  of  elements  in  Y('r  1 . 

Map(m,  w,  s)  "=  0  it  s  <  S  A 
B(wS  +  s)  SC  m  <  B(wS  +  s)  +  B  A 
px  —  B  <  B(wS  +  s)  <  t ux  +  i3xax  (3) 

The  last  clause  in  formula  (3)  bounds  the  possible  values  of  w, 
and  is  used  to  bound  certain  directions  of  the  underlying  polytope 
that  would  otherwise  be  unconstrained.  This  bounding  is  needed 
for  efficiency  in  the  counting  step  that  follows  formula  simplifica¬ 
tion.  The  quantity  B(wS  +  s)  represents  the  address  of  the  first 
byte  in  the  block  containing  memory  location  m.  which  must  be 
within  the  memory  locations  containing  array  Y^x) .  However,  if 
the  starting  address  px  is  not  aligned  on  a  memory  block  bound¬ 
ary,  asserting  that  px  B(wS  +  s)  is  wrong.  As  shown  below, 
the  address  of  the  first  byte  in  the  memory  block  containing  Y^  ’s 
first  element  may  actually  be  less  than  px .  Restricting  w  such  that 
px  —  B  <  B  ( wS  +  s)  is  correct  whether  or  not  the  starting  address 
px  is  aligned  on  a  memory  block  boundary. 

B(wS+s)  m  B(wS+s)+B 


3.1.4  Data  layouts  in  memory 
Row-  and  column-major  layouts  are  easily  expressed  using  Pres¬ 
burger  formulas.  Consider  reference  Ru  =  (Y-x>.  F„ .  Si, )  and 
iteration  point  t.  Let  Fu(l)  =  [«o, . .  . ,  iax- i]T. 

(m  =  Row-majjYu  (£),px))  =r  to  ^  0 

d,- 2  dx-  1 

Am  =  px  +  (  ^2  (  II  nk)ij  +  *<L-i)Ar  (4) 

3=0  k—j  +  1 

(m  =  Col-maj(Fu(f),yUj.))  =r  m  ^  0 
<L- 1  j-i 

Am  =  px  +  (ip  +  ^  (]~f  nk)ij)/3x  (5) 

j=t  k= 0 

Section  3.6  discusses  nonlinear  data  layouts. 

3.2  Describing  cache  behavior  using  Presbur¬ 
ger  formulas 

The  various  pieces  described  in  Section  3.1  fit  together  to  de¬ 
scribe  events  in  the  cache.  We  now  construct  Presburger  formulas 
for  interior  misses,  boundary  misses,  and  cache  state,  as  defined  in 
Sections  2.1  and  2.2.  We  consider  direct-mapped  caches  for  now, 
and  extend  the  formulation  to  set-associative  caches  in  Section  3.5. 

3. 2. 1  Interior  misses 

To  identify  a  cache  miss,  Ghosh  et  al.  [26]  rely  on  the  notion  of 
a  most  recent  access  of  a  memory  block,  which  they  obtain  through 


reuse  vectors.  This  abstraction  is  valid  when  the  array  index  ex¬ 
pressions  are  uniformly  generated  in  addition  to  being  affine  in  the 
LCVs.  We  avoid  this  condition  by  dispensing  with  the  notion  of  a 
most  recent  access  in  our  formulas. 

To  determine  if  an  access  to  a  memory  block  b  results  in  an  in¬ 
terior  miss,  it  is  enough  to  know  two  things:  that  there  is  an  earlier 
access  to  a  different  memory  block  mapping  to  the  same  cache  set 
as  b\  and  that  there  is  no  access  to  b  between  this  earlier  access 
and  the  current  access  to  b.  Let  reference  Ru  =  (Y^,  F„ .  Sp)  at 
iteration  point  i  access  memory  block  bu ,  and  let  reference  Rv  = 
(Y /r,.  Sq)  at  iteration  point  j  access  memory  block  bv.  Sup¬ 
pose  that  access  ( Rv,j )  precedes  access  {Ru,  i),  recalling  the  “pre¬ 
cedes”  relation  from  Section  3.1.2 ;  that  bu  and  bv  are  distinct 
memory  blocks;  but  that  both  bu  and  bv  map  to  the  same  cache 
set  s.  Then,  access  ( Ru  ,i )  suffers  an  interior  miss  if  there  does 
not  exist  a  reference  Rw  =  (Y(;),  Fw,  Sr)  at  iteration  k  access¬ 
ing  memory  block  bw,  such  that  ( Rv,j )  <  {Rw,  k)  <1  (R„ .  i)  and 
bu  =  bw .  The  following  formula  expresses  this  condition. 

((/.’„./)  €  IntMiss(L) )  =ielA 
3d,s  :  Map {Cx{Fu{i)),d,s)  A 
3 e,j,  v  :  ( Rv,j )  <1  {Ru,  i)  A 
Map(£i((Fl,(j)),e,s)  A 
-■( 3k,  w  :  ( Rv,j )  <  {Rw,  k)  <  {Ru,i)  A 

Map{C-{Fw{k)),d,s))  Ad^  e  (6) 

Note  that  it  is  not  necessary  to  have  Y^  =  Y^  in  order  to 
have  {Ru,i)  and  ( Rw,k )  access  the  same  memory  block.  This 
flexibility  accommodates  the  possibility  of  array  aliasing. 

3.2.2  Boundary  misses 

Recall  that  boundary  misses  are  those  that  are  dependent  on  the 
initial  cache  state.  Therefore,  we  are  interested  only  in  those  ac¬ 
cesses  that  are  the  first  to  map  to  a  cache  set  during  the  execution 
of  the  loop  nest.  For  all  other  accesses,  the  cache  set  already  con¬ 
tains  a  memory  block  accessed  during  the  execution  of  the  loop 
nest,  and  initial  cache  state  is  irrelevant.  To  determine  an  actual 
boundary  miss  for  an  access  that  is  the  first  to  map  to  the  cache  set, 
it  simply  remains  to  check  if  the  memory  block  accessed  is  resident 
in  the  initial  cache  state  of  the  set. 

An  access  ( Ru  =  {Y^x\  Fu .  Sp),  i)  to  memory  block  bu  suffers 
a  boundary  miss  if  there  does  not  exist  an  access  ( Ru,j )  preceding 
( Ru,i )  and  accessing  a  memory  block  bv  mapping  to  the  same 
cache  set,  and  bu  is  not  in  the  initial  cache  state  Qn  at  set  s.  Note 
that,  unlike  in  the  formula  for  interior  misses,  there  is  no  constraint 

bu  -f—  by  . 

({Ru,  i)  £  BoundMiss(L,  Qn ))  d=  i  £  I A 
3 d,s  :  Map (£x(Fu(i)),d,  s)  A 
c  :  (Rv,j)  <  (Ru,i)  A 
Map(£„(/-V(j)).ft..s))  A 
B(Lx(Fu(i)))  gCi„(s)  (7) 

3.2.3  Cache  state 

If  the  loop  nest  L  contains  no  memory  access  mapping  to  set 
s,  the  final  cache  state  of  set  s,  Cout  (s),  is  the  same  as  the  initial 
cache  state  C,:„  (s).  Otherwise,  the  final  cache  state  of  set  s  is  the 
address  of  the  memory  block  that  is  not  subsequently  replaced  by 
an  access  to  a  block  of  memory  mapping  to  the  same  cache  set  s. 


(Cout  =  *(L,  Ci„  ))  =  Vs  €  [0,  S  -  1]  :  (3*  :iel  A 
(3d  :  Map(£x(.Fu(i)),d,s)  A 
-'(3 e,j,v  :  ( Ru,i )  <  (Rv,j)  A  Map (£y(Fv(j)),e,s))  A 
Cout(s)  =  B(Cx(Fu(im)\/ 

(  1  (3e  :  Map(£J.(  F„(/)).  e,  s))  A  (s)  =  Q„  (s))  (8) 

3.3  Counting  cache  misses 

We  use  the  Omega  Calculator  [33,  34]  to  simplify  the  formulas 
above  by  manipulating  integer  tuple  relations  and  sets.  After  sim¬ 
plification,  we  are  left  with  formulas  defining  a  union  of  polytopes 
(see  Figure  4  for  an  example).  The  number  of  integer  points  in  this 
union  is  the  number  of  misses.  We  use  PolyLib  [42]  to  operate 
on  such  unions.  We  first  convert  the  union  into  a  disjoint  union  of 
polytopes,  and  then  use  Ehrhart  polynomials  to  count  the  number 
of  integer  points  [18]  in  each  polytope. 

3.4  Extension  to  imperfect  loop  nests 

Extending  our  model  to  imperfect  loop  nests  involves  two  steps. 

1.  We  use  the  transformations  of  Ahmed  et  al.  [2,  3]  to  convert 
an  imperfect  loop  nest  into  a  perfect  loop  nest  with  guards  on 
statements. 

2.  We  extend  the  notion  of  a  valid  iteration  point  to  that  of  a 
valid  access. 

For  each  statement  of  the  loop  nest,  Ahmed  et  al.  define  a  state¬ 
ment  iteration  space  whose  dimension  is  the  number  of  loops  that 
contain  the  statement.  The  product  space  for  the  loop  nest  is  a 
linearly  independent  subspace  of  the  the  Cartesian  product  of  all 
the  statement  iteration  spaces.  Affine  embedding  functions  map  a 
point  in  a  statement  iteration  space  to  a  point  in  the  product  space. 
When  multiple  statements  map  to  the  same  iteration  point  in  prod¬ 
uct  space,  they  are  executed  in  program  order.  In  relation  to  the 
product  space,  embeddings  represent  guards  on  statements,  map¬ 
ping  a  statement  from  its  place  outside  the  innermost  loop  to  a  valid 
place  inside  the  innermost  loop.  We  emphasize  that  the  guards  are 
conceptual,  and  for  analysis  only.  They  do  not  result  in  run-time 
conditional  tests  in  the  generated  code. 

Kelly  and  Pugh  [35,  36]  and  Lim  and  Lam  [41]  have  presented 
other  algorithms  that  embed  imperfect  loop  nests  into  perfect  loop 
nests,  with  similar  end  results.  The  details  of  the  embedding  algo¬ 
rithms  are  not  important  for  our  purpose.  Our  use  of  the  framework 
of  Ahmed  et  al.  merely  reflects  our  greater  familiarity  with  their 
work. 

Figure  1(a)  is  an  improved  version  of  Example  1,  in  which  the 
loop-invariant  reference  C  [  i ,  j  ]  is  hoisted  out  of  the  k-loop  and 
stored  in  a  scalar  x  that  can  be  register-resident.  In  this  imperfect 
loop  nest,  statements  SO  and  S2  occur  outside  of  the  innermost 
loop.  Let  i  X  denote  the  loop  index  variable  i  pertaining  to  state¬ 
ment  SX.  Then  ?0  x  j0  and  i2xj2  are  the  statement  iteration  spaces 
of  statements  SO  and  S2,  respectively.  The  following  embedding 
functions 


map  points  in  these  statement  iteration  spaces  to  points  in  product 
space  [i,  j.  k]'r .  It  is  clear  how  the  guards  on  statements  SO  and 
S2  of  Figure  1(b)  accomplish  this.  Statement  SI  is  already  in  the 
innermost  loop,  and  requires  no  guard  on  it. 


The  second  part  of  the  extension  is  to  insure  that  our  model  can 
handle  array  references  that  are  guarded  in  this  manner.  We  accom¬ 
plish  this  effect  by  extending  our  notion  of  a  valid  iteration  point 
(Section  3.1)  to  that  of  a  valid  access. 

Let  Ru  =  {Y(x\  Fu, Sh)  be  the  reference  with  0  ^  u  < 
k.  Let  Gh(i )  be  the  guard  of  statement  Sh  in  the  product  space 
version  of  the  loop  nest.  We  assume  that  the  guards  are  expressible 
in  Presburger  arithmetic.  For  Figure  1(b),  Go  =  (12  =  0),  G\  = 
true,  and  G2  =  (12  =  ri2  —  1).  Then  ( Ru  =  (Y(x\  Fu,  Sh),i)  is 
a  valid  access  if  i  belongs  to  the  iteration  space,  and  Gh(i )  holds. 
The  predicate  (R„ .  i)  €  T  represents  this  fact, 

(Ru,i)  €  I  d=  i  e  I  A  0  sC  u  <  k  A  Gh  ( i )  (9) 

With  this  extension,  the  formulas  from  Section  3.2  apply  directly, 
with  every  occurrence  of  i  6  I  replaced  by  ( Ru,i )  6  I. 

3.5  Associativity 

We  currently  handle  associativity  in  a  straightforward  manner, 
assuming  a  Least  Recently  Used  replacement  policy.  From  Sec¬ 
tion  3.2.1,  we  simply  need  to  allow  at  least  A  distinct  accesses 
preceding  ( Ru,i )  to  unique  memory  blocks,  such  that  there  is  no 
access  ( Rw,k )  accessing  the  same  memory  block  as  ( Ru,i )  and 
(Ru0,jo)  <1  {Rw,k)  <1  (Ru,  i)  (where  (Rvo,jo)  is  the  earliest  of 
at  least  A  references  to  unique  memory  blocks).  The  following 
Presburger  formula  expresses  interior  misses  for  an  A-way  set- 
associative  cache. 

((Ru,i)  e  IntMiss)  d=  iel A 
3 d,s  :  Map (£.x(Fu(i)),d,  s)  A 
3e0,  jo, vo  :  (Rvo,jo)  <  (Ru,i)  A 
Map(£M  (Fvo  (jo)),  e0,  s)  A 


(3ja,na  :  (Rvq  ,  jo)  <1  (Rva,  ja)  <1  (Ru,i)  A 

a= 1 

Map(£i/o  (FVa  (ja)),ea,  s))  A 
d  eo  eA- 1)  A 

— 1  ( 3 A- ,  w  :  (Rvo,jo)  <  (Rw,k)  <  (Ru,  i )  A 

Map(£:(i4(A)),d,s))  (10) 

This  method  will  handle  modest  values  of  A,  and  the  complexity 
of  the  formulas  certainly  increases  with  A.  Presburger  formulas 
for  cache  state  and  boundary  misses  with  associativity  A  are  non- 
obvious,  and  will  require  more  work  to  construct. 

3.6  Array  layouts  based  on  bit  interleaving 

Previous  work  [14.  15.  21]  suggests  that  non-linear  data  layouts 
provide  better  cache  performance  than  canonical  layout  functions 
in  some  numerical  codes.  Such  layout  functions  are  described  in 
terms  of  interleavings  of  the  bits  in  the  binary  expansions  of  the 
array  coordinates  rather  than  as  affine  functions  of  the  numerical 
values  of  these  quantities.  We  describe  such  bit  interleavings  and 
provide  formulations  of  these  layouts  in  Presburger  arithmetic. 

In  developing  the  model  of  alternative  array  layouts,  we  assume 
that  rij  =  2qj  for  some  j  6  [0,  dx  —  1]  (where  dx  is  the  number 
of  coordinates  in  an  array  T’^).  Therefore,  the  bit  representation 
of  an  array  index  will  have  qj  bits,  with  the  least  significant  bit 
(LSB)  numbered  0  and  the  most  significant  bit  (MSB)  numbered 
qj  —  1.  We  identify  the  binary  sequence  sq- 1  ...  so  with  the  non- 


do  i  =  0,  n-1 

do  i  =  0,  n-1 

do  j  =  0,  n-1 

do  j  =  0,  n-1 

SO: 

x  =  C  [i,  j  ] 

do  k  =  0,  n-1 

do  k  =  0,  n-1 

SO  : 

if  (k  ==  0)  x  =  C [ i , j ] 

SI: 

x  =  A[i, k] *B  [k, j]  +  x 

SI : 

x  =  A  [ i ,  k]  *B  [k,  j  ]  +  x 

end 

S2  : 

if  (k  ==  n-1)  C  [i, j]  = 

S2  : 

C  [i,  j  ]  =  x 

end 

end  end 

end  end 

Figure  1:  (a)  An  imperfect  loop  nest  for  matrix  multiplication,  (b)  The  product  space  version  with  guards. 


negative  integer  s  =  YllL c/  •s >  2 ’  •  We  denote  by  Bqj  the  set  of  all 
binary  sequences  of  length  g, ,  and  extend  the  above  identification 
to  identify  Bq.  with  the  interval  [0,  2 q$  —  1], 

We  describe  a  family  of  nonlinear  layout  functions  parameter¬ 
ized  by  a  single  parameter  cr,  as  follows.  An  (go,  ■  •  • ,  g(z,  - 1)- 
interleaving,  cr,  is  a  sequence  of  length  p  (where  p  =  1  9*) 

over  the  alphabet  {0, ....  (dx  —  1)}  containing  q;  i’s.  It  describes 
the  order  in  which  bits  from  the  dx  array  coordinates  are  interleaved 
to  linearize  the  array  in  memory. 

An  array  layout  functions  as  a  map  from  dx  array  coordinates  to  a 
memory  address.  Therefore,  given  an  (go,  •  •  ■ ,  Qdx-i  (-interleaving 
cr,  define  a  map 

Q  :  Bqo  x  •  •  •  x  B,ld^  _  1  — $■  Bp 

in  the  following  way.  If  =  xlq‘[>_1  . . .  x[:) £  Bq\H  6 
[0,  dx  —  1],  then  0(x(o\  . . .  ,  xl'dlc~l) )  is  the  sequence  obtained  by 
replacing  the  jth  u  from  the  right  with  x'-‘ 1 .  We  extend  this  nota¬ 
tion  to  consider  0  as  a  map  from  [0,  2q°  —  1]  x  •  •  •  x  [0,  2qd*~1  — 
1]  to  [0,  2P  —  1]  by  identifying  non-negative  integers  and  their  bi¬ 
nary  expansions.  We  call  0  the  mixing  function  indexed  by  cr.  Note 
that  0(0, . . .  ,  0)  =  0  for  any  cr. 

Example  2  Let  dx  =  2,  no  =  16  (go  =  4),  m  =  16  (gi  =  4), 
and  cr  =  10110010.  Then 

0(12,  5)  =  0(1100,  0101)  =  01101010  =  106. 

Example  3  Let  dx  =  3,  no  =  8  (go  =  3),  m  =  8  (gi  =  3), 
n-2  =  4  (g2  =  2),  and  let  cr  =  21102001.  Then 

©(3,7,1)  =0(011,111,001)  =01101111  =  111. 

The  idea  behind  translating  such  a  data  layout  into  a  Presburger 
formula  is  to  define  the  bit  values  of  the  binary  expansion  of  the 
memory  address  using  Presburger  arithmetic.  Consider  again  ref¬ 
erence  Ru  =  (Y<x\  Fu,  Sh)  and  iteration  point  l.  For  every  nj 
where  0  ^  j  <  dx,  let  nj  =  2‘L  .  Then  cr  is  an  (g0, .. . ,  qdx.m)~ 
interleaving.  Then  we  can  compute  the  following  dx  x  p  matrix 
Z(cr).  Letting  g  =  cr  f,  the  fl  column  of  Z  (cr)  consists  of  2e  in 
the  r/h  position,  where  cry  is  the  el  g  from  the  right,  and  zeros  in 
every  other  position.  Z (cr)  can  be  thought  of  as  a  transformation 
that  when  applied  to  the  binary  expansion  of  a  memory  address  m, 
produces  the  coordinates  of  the  array  element  at  m. 

Example  4  Given  that  dx  =  3,  no  =  8  (go  =  3),  ni  =8  (gi  =  3), 
n-2  =  4  (g2  =  2),  and  cr  =  12102010, 

'00040201' 

Z(cr)  =  4  0  2  0  0  0  1  0 

0  2  0  0  1  0  0  0 


The  following  formula  maps  Fu(£),  px,  and  Z  (cr)  to  memory 
location  m.  Let  p  =  J2j=o 1  9i  >  an^  let  Af  =  [mp- 1 ,  •  •  •  ,  mo] T  ■ 

(m  =  Interleave  (f^  (£),  px,  Z(cr)))  =r 
3m0yy ,  nip—  i , . . . ,  nio  . 

0  ^  nip- i,...,mo^lAm)0A 
m  =  px  +  m0ff/3x  A  Fu(l)  =  Z (a)M  A 

p- 1 

moff  ~  y ^rrik2k  (11) 

k= o 

Data  layouts  such  as  X-Morton  and  U-Morton  [15]  require  an 
X-OR  operation  in  addition  to  bit  interleaving.  (Note  that  this  for¬ 
malism  applies  only  to  n  x  n  arrays.)  The  additional  X-OR  op¬ 
eration  can  also  be  expressed  as  a  Presburger  formula  on  the  bit 
representation. 

3.7  Physically  addressed  caches 

The  techniques  described  thus  far  operate  on  virtual  addresses. 
However,  many  systems  utilize  physical  indexed  caches  ( e.g .,  sec¬ 
ond  level  caches)  whose  performance  is  highly  dependent  on  page 
placement.  Fortunately,  most  operating  systems  employ  page  col¬ 
oring  techniques  that  minimize  this  effect  [37]  by  creating  virtual 
to  physical  page  mappings  such  that  the  virtual  and  physical  cache 
index  are  identical.  It  may  also  be  possible  to  extend  our  analysis 
to  include  the  effects  of  page  placement;  we  leave  this  as  future 
research. 

4.  RESULTS 

In  this  section,  we  present  and  interpret  cache  behavior  as  ob¬ 
tained  by  our  method  on  five  model  problems,  and  validate  them 
against  cache  miss  counts  produced  by  a  (specially-written)  cache 
simulator.  Unless  otherwise  specified,  we  use  a  direct-mapped 
cache  with  capacity  4096  bytes  and  block  size  of  32  bytes  that 
is  initially  empty.  We  assume  that  all  data  arrays  contain  double¬ 
precision  numbers  (so  that  3  is  eight  bytes),  and  that  all  arrays  are 
linearized  in  column-major  order.  The  total  number  of  misses  for 
each  array  match  up  exactly  between  our  model  and  the  simulator 
in  all  cases,  but  their  partitioning  differs.  We  explain  the  implica¬ 
tions  of  this  difference  in  Section  4.1. 

Problem  1  (Matrix  multiplication)  We  count  boundary  and  in¬ 
terior  misses  for  each  array  for  the  matrix  multiplication  kernel 
shown  in  Example  1,  under  four  scenarios. 

1.  Problem  size  n  =  21,  the  leading  dimension  of  each  array  is 
n,  and  the  three  arrays  are  adjacent  to  each  other  in  memory 
address  space  ( i.e .,  px  =  0,  //  «  =  j3n2 ,  and  pc  =  2 3n2). 
We  show  results  for  all  six  possible  permutations  of  the  loop 


orders,  from  both  our  approach  and  from  explicit  cache  sim¬ 
ulation.  This  is  representative  of  a  code  where  both  the  it¬ 
eration  space  and  the  data  arrays  are  tiled.  Placing  the  ar¬ 
rays  back-to-back  causes  two  memory  blocks  to  be  shared 
between  arrays.  Figure  2(a)  tabulates  the  results.  The  jki 
loop  order  is  seen  to  be  substantially  superior  in  terms  of  total 
misses. 

2.  Problem  size  n  =  20,  the  leading  dimension  of  each  array 
is  n,  and  the  three  arrays  are  adjacent  to  each  other  in  mem¬ 
ory  address  space.  We  show  results  for  all  six  possible  per¬ 
mutations  of  the  loop  orders,  from  both  our  approach  and 
front  explicit  cache  simulation.  This  scenario  is  similar  to 
the  previous  one,  but  there  is  no  sharing  of  memory  blocks 
between  arrays.  Figure  2(b)  tabulates  the  results.  The  num¬ 
ber  of  misses  is  somewhat  smaller,  and  the  jki  loop  order 
wins  again. 

3.  Problem  size  n  =  21,  the  leading  dimension  of  each  array  is 
n,  and  the  three  arrays  collide  in  cache  space  (i.e.,  fiA  =  0, 
I_ib  =  4096,  and  fi c  =  8192).  This  represents  a  situation 
where  the  arrays  do  not  use  the  cache  effectively  (occupying 
only  111  of  the  128  cache  sets).  We  show  results  for  all  six 
possible  permutations  of  the  loop  orders,  front  both  our  ap¬ 
proach  and  from  explicit  cache  simulation.  Figure  2(c)  tab¬ 
ulates  the  results.  The  number  of  misses  rises  dramatically, 
as  expected;  the  jki  loop  order  produces  the  fewest  cache 
misses,  but  not  by  as  large  a  margin. 

4.  Problem  size  n  =  20,  the  leading  dimension  of  each  array 
is  kn  (for  k  6  {1,  2,  3}),  and  the  three  arrays  are  adjacent 
to  each  other  in  memory  address  space.  This  represents  a 
situation  where  the  iteration  space  is  tiled  but  the  data  is  not 
reorganized,  resulting  in  the  data  tiles  not  being  contiguous 
in  memory  space.  We  show  only  the  i  jk  loop  order.  Fig¬ 
ure  2(d)  tabulates  the  results.  The  total  number  of  misses 
for  each  array  change  with  the  leading  dimension,  although 
different  arrays  behave  differently. 


The  model  correctly  classifies  all  the  misses  in  the  first  loop  nest  as 
boundary  misses.  The  cache  contains  all  of  array  C  at  the  end  of 
the  first  loop  nest,  so  all  of  the  misses  of  C  in  the  second  loop  nest 
are  interior  misses.  Figure  3  graphically  represents  cache  state  at 
the  end  of  the  computation. 

Problem  3  (Imperfect  loop  nest)  We  count  boundary  and  interior 
misses  for  each  array  for  the  imperfect  loop  version  of  the  matrix 
multiplication  kernel  of  Figure  1  with  n  =  21,  with  the  leading 
dimension  of  each  array  being  n.  This  demonstrates  how  the  model 
handles  imperfect  loop  nests.  We  show  two  scenarios. 

The  first  scenario  has  the  three  arrays  adjacent  to  each  other  in 
memory  address  space.  The  miss  counts  are  as  follows. 


A 

B 

C  (read) 

C  (write) 

Bnd 

28 

92 

8 

0 

Int 

521 

866 

383 

0 

Total 

549 

958 

391 

0 

Cold 

110 

110 

111 

0 

Repl 

439 

848 

280 

0 

The  significant  observation  is  that  none  of  the  write  references  to  C 
miss,  even  though  there  are  many  references  to  A  and  B  between 
the  read  and  the  write  reference  to  C  [  i ,  j  ] .  The  total  number  of 
misses  is  identical  to  that  of  Problem  1,  scenario  1. 

The  second  scenario  has  the  arrays  colliding  in  the  cache.  The 
miss  counts  are  as  follows. 


A 

B 

C  (read) 

C  (write) 

Bnd 

20 

90 

1 

0 

Int 

980 

648 

440 

441 

Total 

1000 

738 

441 

441 

Cold 

111 

111 

111 

0 

Repl 

889 

627 

330 

441 

Now  every  read  and  write  reference  to  C  [  i ,  j  ]  misses.  However, 
the  total  number  of  cache  misses  is  significantly  smaller  than  the 
corresponding  case  in  Problem  1,  scenario  3,  showing  the  benefit 
of  allocating  C  [  i ,  j  ]  in  a  register. 


Problem  2  (Multiple  loop  nests)  We  count  boundary  and  interior 
misses  for  each  array  for  the  following  variation  on  the  matrix  mul¬ 
tiplication  kernel. 

do  i  =  0,  n-1  /*  Loop  nest  1  */ 

do  j  =  0,  n-1 
C [i, j]  =  0 
end 
end 

do  i  =  0,  n-1  /*  Loop  nest  2  */ 

do  j  =  0,  n-1 
do  k  =  0,  n-1 

C[i,  j]  =  A  [  i ,  k  ]  *B[k,  j]  +  C[i,  j] 
end 
end 
end 

The  layout  constraints  are  identical  to  those  in  Problem  1,  sce¬ 
nario  1.  This  demonstrates  how  the  model  handles  multiple  loop 
nests. 

The  miss  counts  are  as  follows. 


1  A  | 

!  B  ! 

1  c  | 

Loop 

Bnd 

Int 

Tot 

Bnd 

Int 

Tot 

Bnd 

Int 

Tot 

i 

0 

0 

0 

0 

0 

0 

111 

0 

111 

2 

28 

521 

549 

92 

866 

958 

0 

383 

383 

Problem  4  (Set-associative  cache)  We  count  interior  misses  for 
each  array  for  the  matrix  multiplication  kernel  shown  in  Example  1, 
using  two-way  associative  caches.  The  layout  constraints  are  iden¬ 
tical  to  those  in  Problem  1,  scenario  2.  This  demonstrates  how  the 
model  handles  associativity. 

Both  scenarios  consider  a  two-way  associative  cache  with  block 
size  of  32  bytes  that  is  initally  empty.  The  cache  has  a  capacity 
of  4096  bytes  in  the  first  scenario  and  8192  bytes  in  the  second 
scenario.  The  miss  counts  are  as  follows. 


C 

=  4096 

C 

=  8192 

A 

B 

C 

A 

B 

C 

Bnd 

128 

256 

Int 

75 

773 

213 

8 

0 

36 

Total 

1189 

300 

Cold 

100 

100 

100 

100 

100 

100 

Repl 

0 

757 

132 

0 

0 

0 

The  total  number  of  boundary  misses  in  each  scenario  is  deter¬ 
mined  by  the  number  of  cache  frames  in  the  footprint  of  all  three 
arrays  in  cache.  For  every  cache  frame  that  is  touched  during  the 
matrix  multiplication  kernel,  the  first  instance  of  a  memory  block 
being  mapped  to  the  cache  frame  incurs  a  boundary  miss  since  the 
cache  is  initially  empty.  In  the  first  scenario,  there  are  64  cache 


Loop 

A 

B 

C 

Grand 

order 

Bnd 

Int 

Tot 

Cold 

Repl 

Bnd 

Int 

Tot 

Cold 

Repl 

Bnd 

Int 

Tot 

Cold 

Repl 

Total 

ijk 

28 

521 

549 

110 

439 

92 

866 

958 

110 

848 

8 

383 

391 

111 

280 

1898 

ikj 

18 

445 

463 

110 

353 

85 

1985 

2070 

110 

1960 

25 

1563 

1588 

111 

1477 

4121 

jik 

108 

590 

698 

110 

588 

18 

502 

520 

110 

410 

2 

109 

111 

111 

0 

1329 

jki 

104 

355 

459 

110 

349 

18 

167 

185 

110 

75 

6 

207 

213 

111 

102 

857 

kij 

2 

184 

186 

110 

76 

34 

1644 

1678 

110 

1568 

92 

1624 

1716 

111 

1605 

3580 

kji 

9 

297 

306 

110 

196 

31 

436 

467 

110 

357 

88 

530 

618 

111 

507 

1391 

Loop 

A 

B 

C 
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order 

Bnd 

Int 

Tot 

Cold 

Repl 

Bnd 

Int 

Tot 

Cold 

Repl 

Bnd 

Int 

Tot 

Cold 
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ijk 

25 

405 

430 
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85 
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746 
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18 
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316 

100 

216 

1492 
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23 
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100 
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73 

1533 

1606 

100 

1506 

32 

1205 

1237 
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1137 
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97 
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28 

345 
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3 

97 
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0 

979 
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95 
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28 
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59 

5 
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100 

65 
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Figure  2:  Miss  counts  from  our  approach  (Bnd  and  Int)  and  from  cache  simulation  (Cold  and  Repl).  (a)  Problem  1,  scenario  1.  (b) 
Problem  1,  scenario  2.  (c)  Problem  1,  scenario  3.  (d)  Problem  1,  scenario  4. 
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Figure  3:  Cache  state  at  the  end  of  the  computation  described  in  Problem  2.  The  shaded  blocks  are  cache-resident.  There  are  exactly 
128  shaded  memory  blocks.  Arrays  A  and  B  share  a  block,  as  do  arrays  B  and  C.  The  block  with  the  heavy  outline  in  each  array 
maps  to  cache  set  0. 
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sets.  Since  the  cache  footprint  of  all  arrays  ‘wraps  around’  at  least 
twice,  we  know  that  all  128  cache  frames  are  touched.  Hence,  there 
are  128  boundary  misses  in  scenario  1.  Similarly,  we  can  determine 
that  there  are  256  boundary  misses  in  scenario  2. 

Problem  5  (Symbolic  analysis)  We  analyze  the  matrix- vector  prod¬ 
uct  example  from  Fricker  et  al.  [22]  to  show  the  symbolic  process¬ 
ing  capabilities  of  our  approach.  The  code  is  as  follows. 

do  jl  =  0,  N-l 
reg  =  Y [ j 1 ] 
do  j2  =  0,  N-l 

reg  +=  A[j2,  jl]  *  X [ j 2 ] 

end 

Y [ j 1 ]  =  reg 

end 

We  focus  on  the  interior  misses  on  X  due  to  interference  from  A, 
assuming  that  /i  a  =  0  but  that  fix  is  symbolic.  For  compatibility 
with  Fricker  et  al.,  we  use  a  direct-mapped  cache  of  capacity  8192 
bytes  and  block  size  of  32  bytes,  and  we  choose  N  =  100.  The 
formula  shown  in  Figure  4  is  a  pretty-printed  version  of  formula 
(6)  as  simplified  by  the  Omega  Calculator.  While  the  formula  ap¬ 
pears  formidable,  it  should  be  kept  in  mind  that  it  captures  the  miss 
patterns  for  all  possible  values  of  fix-  In  principle,  the  Ehrhart 
polynomial  of  the  polytope  union  represented  by  this  formula  can 
be  computed,  enabling  counting  of  the  number  of  misses  for  a  par¬ 
ticular  value  of  fix  by  evaluating  this  polynomial. 

4.1  Interpretation  of  results 

Two  general  observations  on  the  results  are  worth  mentioning. 

First,  the  model  results  are  identical  to  the  simulation  results  in 
all  cases.  This  reinforces  the  exactness  of  the  model,  which  is  a 
major  strength. 

Second,  the  classification  of  cache  misses  into  boundary  and  in¬ 
terior  misses  rather  than  cold  and  replacement  misses  is  a  signif¬ 
icant  departure  from  previous  models.  Boundary  misses  are,  in  a 
sense,  a  cache-centric  analog  of  cold  misses.  Just  as  the  number 
of  cold  misses  of  a  program  is  no  more  than  the  number  of  mem¬ 
ory  blocks  occupied  by  the  data,  the  number  of  boundary  misses 
is  no  more  than  the  combined  cache  footprint  of  the  data,  which 
is  itself  bounded  above  by  the  number  of  cache  sets.  This  can  be 
verified  by  totaling  the  boundary  or  cold  misses  in  any  row  of  Fig¬ 
ure  2(a)  (where  the  totals  are  128  and  300,  respectively)  or  Fig¬ 
ure  2(b)  (where  the  totals  are  111  and  333,  respectively).  In  gen¬ 
eral,  then,  the  number  of  boundary  misses  should  be  significantly 
smaller  than  the  number  of  cold  misses.  Thus,  our  classification  al¬ 
lows  for  more  precise  context-free  identification  of  misses,  leaving 
many  fewer  references  to  be  resolved  from  cache  state. 

4.2  Running  times 

Figure  5  shows  a  histogram  of  the  running  times  required  by  the 
Omega  Calculator  [33]  to  simplify  108  sample  cache  miss  formulas 
on  a  300MHz  Sparc  Ultra  60.  The  samples  are  made  up  of  bound¬ 
ary  and  interior  miss  formulas  for  each  array  in  scenarios  1,  2,  and  3 
of  Problem  1.  The  cache  miss  formulas  are  generated  from  source 
code  using  a  SUIF  [55]  compiler  pass  that  we  developed  for  this 
purpose.  The  time  required  for  formula  generation  is  negligible. 

Half  of  the  cache  miss  formulas  run  in  less  than  10  seconds, 
and  the  majority  of  those  formulas  run  in  less  than  1  second.  The 
boundary  miss  formulas  simplified  quickly  (most  in  under  a  sec¬ 
ond),  while  the  time  required  to  simplify  the  interior  miss  formulas 
varied  widely.  We  have  observed  the  running  time  of  a  formula  to 
be  strongly  correlated  to  the  number  of  cache  misses  it  generates. 

Given  these  running  times,  our  approach  clearly  does  not  yet 
have  enough  performance  to  be  practical;  however,  the  value  of  in- 


Figure  5:  Histogram  of  running  times  of  formula  simplification 
using  the  Omega  calculator  on  a  300  MHz  Sparc  Ultra  60. 

sight  gained  from  our  approach  should  not  be  overlooked.  Further¬ 
more,  it  is  not  clear  how  much  of  the  slow  running  time  is  a  con¬ 
sequence  of  our  formulations  and  how  much  is  due  to  the  Omega 
software.  We  hope  to  make  this  determination  in  the  immediate 
future,  by  investigating  other  software  options. 

5.  RELATED  WORK 

We  organize  related  work  by  the  way  in  which  they  handle  cache 
behavior:  compiler-centric,  language-centric,  architecture-centric, 
or  trace-centric  (including  simulation). 

Compiler-centric.  The  work  of  Ghosh  et  al.  [23,  24,  25,  26]  is 
most  closely  related  to  our  framework  for  analytical  cache  model¬ 
ing.  (Zhang  and  Martonosi  [64]  have  recently  begun  extending  this 
work  to  pointer  data  structures.)  They  introduce  additional  con¬ 
straints  to  make  the  problem  tractable.  We  avoid  these  constraints. 

Work  by  Ahmed  et  al.  on  tiling  imperfect  loop  nests  [2,  3]  em¬ 
beds  the  iteration  space  of  each  statement  of  a  loop  into  a  product 
space.  We  use  this  transformation  in  Section  3.4. 

Cagcaval  [12]  estimates  the  number  of  cache  misses  using  stack 
distances.  His  prediction  is  valid  for  a  fully-associative  cache  with 
LRU  replacement,  and  requires  a  probabilistic  argument  to  transfer 
to  a  cache  with  smaller  associativity.  He  assumes  that  each  loop 
nest  starts  with  a  cold  cache,  sacrificing  the  accuracy  gained  from 
knowing  the  actual  cache  state  at  the  start  of  the  loop  nest. 

Our  simplified  formulas  resemble  LMADs  [47]  in  several  re¬ 
spects.  Establishing  a  connection  between  them  remains  the  subject 
of  future  work. 

Language-centric.  Alt  et  al.  [5]  apply  Abstract  Interpretation  to 
predicting  the  behavior  of  instruction  cache,  for  general  programs. 
Their  notion  of  cache  state  is  somewhat  different  from  ours. 

Prior  empirical  evidence  [14,  15,  21]  suggests  that  alternative 
array  layout  functions  provide  better  cache  behavior  than  canoni¬ 
cal  layout  functions  for  many  dense  linear  algebra  codes.  Previous 
work  [27]  has  taken  a  combinatorial  approach  to  modeling  cache 
misses  in  the  presence  of  such  non-linear  data  layouts. 

Architecture-centric.  Lam,  Rothberg.  and  Wolf  [38]  discuss  the 
importance  of  cache  optimizations  for  blocked  algorithms,  using 
matrix  multiplication  as  an  example.  Their  simulation-based  anal¬ 
ysis  is  exact,  but  their  performance  models  are  approximate. 

Fricker  et  al.  [22]  develop  a  model  for  approximating  cache  in¬ 
terferences  incurred  while  executing  blocked  matrix  vector  multi¬ 
ply  in  a  specific  cache.  Their  analysis  is  inexact  in  considering  only 
cross-interferences  and  neglecting  redundancies  among  array  pairs. 

McKinley  and  Temam  [44]  examine  locality  characteristics  in 
the  SPEC’ 95  and  Perfect  Benchmarks.  Their  discovery  most  perti- 


{[ji,  0,  s,  d\  :  3(a  :  1  ^  ji  99  A  0  sC  s  255  A  a  <  d  A  32s  4  8192d  <C  fix  A  800ji  4  8192d  <C  792  +  fix  4  8192a  A 
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fix  4~  8j2  4  7  4  8192 d  -4  32s  A  4s  4  1024a  <  lOOji  -4  j2  A  256  4  25jii  4  2o6d  4  s)} 
U{[jii ,  j‘> ,  s,  (/]  :  3(a  ■  0  4  j\  4  99  A  0  ^  j2  4  99  A  0  ^  s  ^  255  A  8192d  -4  32s  4  fix  4"  8j2  A  fix  4  8j2  4  31  4*  8192(1  4*  32s  A 

1021  4-  lOOji  4  j2  4  1024d  +  4s  A  lOOji  4-  j2  4  3  4-  4s  4-  1024a  A  4s  +  1024a  4  100ji  4-  j2)} 
U{[ji ,  0,  s,  d]  :  3(a  :  1  4  ji  4  99  A  0  4  s  4  255  A  257  4-  s  4-  256d  4  25ji  A  32s  4-  8192d  4  fix  A 
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Figure  4:  A  formula  describing  interior  misses  on  A'  due  to  interference  from  A  in  Problem  5.  Each  4-tuple  is  of  the  form  [ji ,  j2.  s.  d\, 
where  ( ji ,  j2 )  identifies  the  iteration  at  which  the  miss  occurs,  and  (s,  d)  identifies  the  set  and  the  cache  wraparound  of  the  reference. 


nent  to  our  work  is  that  most  misses  are  internest  capacity  misses. 

Harper  et  al.  [28]  present  an  analytical  model  that  focuses  on 
set-associative  caches.  Their  model  approximates  the  cache  miss- 
ratio  of  a  looping  construct  and  allows  imperfect  loop  nests  to  be 
considered.  They  do  not  attempt  to  analyze  multiple  loop  nests. 

Trace-centric.  Prior  research  [1,  57]  has  investigated  various 
analytic  cache  models  by  extracting  parameters  from  the  reference 
trace.  Simulation  techniques,  such  as  cache  profiling  [43,  39],  can 
provide  insight  on  potential  program  transformations  by  classifying 
misses  according  the  cause  of  the  cache  miss.  All  trace-centric 
approaches  usually  require  full  execution  of  the  program. 

Weikle  et  al.  [58,  59]  introduce  the  novel  idea  of  viewing  caches 
as  filters.  This  framework  is  not  limited  to  analyzing  loop  nests  or 
other  particular  program  constructs,  but  can  handle  any  pattern  of 
memory  references.  Brehob  and  Enbody  [11]  model  locality  using 
distances  between  memory  references  in  a  trace. 

Wood  et  al.  [63]  explore  the  problem  of  resolving  unknown  ref¬ 
erences  in  simulation — first-time  references  to  memory  blocks  that 
may  miss  or  hit  depending  on  the  cache  state  at  the  beginning  of 
the  trace  sample — and  show  that  accurate  estimation  of  their  miss 
rate  is  necessary.  We  use  cache  state  to  resolve  such  unknown  ref¬ 
erences,  and  then  categorize  them  as  boundary  misses  or  hits. 

6.  CONCLUSIONS  AND  FUTURE  WORK 

This  work  initially  began  from  the  intuition  that  the  CME  formu¬ 
lation  of  Ghosh  et  al.  was  not  fully  exploiting  all  of  the  regularity 
inherent  in  the  problem.  The  output  of  the  Presburger  formulas 
vividly  illustrates  this  regularity,  allowing  us  to  employ  general- 
purpose  tools  for  counting  misses. 

While  powerful  mathematical  results  (such  as  the  existence  of 
the  Ehrhart  polynomial)  are  known  for  polytopes,  the  correspond¬ 
ing  algorithms  are  complex  and  subject  to  geometric  degeneracies. 
As  a  result,  the  software  libraries  are  not  very  robust.  Such  de¬ 
generacies  have  prevented  us,  for  example,  from  calculating  the 
Ehrhart  polynomial  for  the  formula  in  Figure  4.  Similar  comments 
apply,  with  less  severity,  to  the  Presburger  decision  procedures.  The 
robustness  of  both  libraries  needs  to  be  improved  substantially  to 
realize  the  full  potential  of  our  approach. 


While  we  have  made  some  progress  in  handling  associativity, 
symbolic  constants,  and  non-linear  array  layouts,  much  remains  to 
be  done  on  all  three  fronts.  Our  current  handling  of  associativity  is 
incomplete  and  unscalable;  in  particular,  it  is  not  powerful  enough 
to  model  TLB  behavior.  Our  ability  to  handle  symbolic  constants 
derives  from,  and  is  therefore  limited  by,  the  corresponding  capa¬ 
bility  in  Omega.  The  constraints  introduced  in  Section  3.6  to  han¬ 
dle  non-linear  data  layouts  are  essentially  0-1  integer  programming 
constraints,  which  are  likely  to  cause  bad  behavior  in  the  Presbur¬ 
ger  decision  procedures. 

We  have  recently  become  aware  of  an  alternative  tool  [10]  that 
claims  to  be  more  aggressive  at  formula  simplification  than  Omega, 
and  also  of  an  alternative  approach  to  representing  Presburger  for¬ 
mulas  using  finite  automata  [8,  9],  We  intend  to  explore  both  these 
options  to  try  to  improve  the  efficiency  of  our  system.  However, 
the  general  problem  of  simplifying  arbitrary  Presburger  formulas 
is  intrinsically  difficult,  no  matter  whether  one  views  it  front  the 
perspective  of  logic,  number  theory,  computational  geometry  [53], 
automata  theory,  or  something  else.  In  the  end,  the  only  practical 
path  to  efficiency  may  involve  developing  specialized  algorithms 
that  exploit  some  structural  constraints  of  the  kinds  of  formulas 
that  arise  in  our  application. 

In  addition  to  compiler-related  uses,  our  approach  may  also  sig¬ 
nificantly  speed  up  cache  simulators  by  enabling  them  to  rapidly 
leap-frog  the  computation  over  polyhedral  loop  nests  that  consume 
most  of  the  running  time.  The  development  of  such  a  mixed-mode 
cache  simulator  remains  the  subject  of  future  work. 
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